The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Sun Jun 14 22:06:00 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Sun Jun 14 22:06:00 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.13.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.13.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.13.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.13.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 144 144
Albania 144 144
Algeria 144 144
Andorra 144 144
Angola 144 144
Antigua and Barbuda 144 144
Argentina 144 144
Armenia 144 144
Australia 1152 1152
Austria 144 144
Azerbaijan 144 144
Bahamas 144 144
Bahrain 144 144
Bangladesh 144 144
Barbados 144 144
Belarus 144 144
Belgium 144 144
Belize 144 144
Benin 144 144
Bhutan 144 144
Bolivia 144 144
Bosnia and Herzegovina 144 144
Botswana 144 144
Brazil 144 144
Brunei 144 144
Bulgaria 144 144
Burkina Faso 144 144
Burma 144 144
Burundi 144 144
Cabo Verde 144 144
Cambodia 144 144
Cameroon 144 144
Canada 2016 2016
Central African Republic 144 144
Chad 144 144
Chile 144 144
China 4752 4752
Colombia 144 144
Comoros 144 144
Congo (Brazzaville) 144 144
Congo (Kinshasa) 144 144
Costa Rica 144 144
Cote d’Ivoire 144 144
Croatia 144 144
Cuba 144 144
Cyprus 144 144
Czechia 144 144
Denmark 432 432
Diamond Princess 144 144
Djibouti 144 144
Dominica 144 144
Dominican Republic 144 144
Ecuador 144 144
Egypt 144 144
El Salvador 144 144
Equatorial Guinea 144 144
Eritrea 144 144
Estonia 144 144
Eswatini 144 144
Ethiopia 144 144
Fiji 144 144
Finland 144 144
France 1584 1584
Gabon 144 144
Gambia 144 144
Georgia 144 144
Germany 144 144
Ghana 144 144
Greece 144 144
Grenada 144 144
Guatemala 144 144
Guinea 144 144
Guinea-Bissau 144 144
Guyana 144 144
Haiti 144 144
Holy See 144 144
Honduras 144 144
Hungary 144 144
Iceland 144 144
India 144 144
Indonesia 144 144
Iran 144 144
Iraq 144 144
Ireland 144 144
Israel 144 144
Italy 144 144
Jamaica 144 144
Japan 144 144
Jordan 144 144
Kazakhstan 144 144
Kenya 144 144
Korea, South 144 144
Kosovo 144 144
Kuwait 144 144
Kyrgyzstan 144 144
Laos 144 144
Latvia 144 144
Lebanon 144 144
Lesotho 144 144
Liberia 144 144
Libya 144 144
Liechtenstein 144 144
Lithuania 144 144
Luxembourg 144 144
Madagascar 144 144
Malawi 144 144
Malaysia 144 144
Maldives 144 144
Mali 144 144
Malta 144 144
Mauritania 144 144
Mauritius 144 144
Mexico 144 144
Moldova 144 144
Monaco 144 144
Mongolia 144 144
Montenegro 144 144
Morocco 144 144
Mozambique 144 144
MS Zaandam 144 144
Namibia 144 144
Nepal 144 144
Netherlands 720 720
New Zealand 144 144
Nicaragua 144 144
Niger 144 144
Nigeria 144 144
North Macedonia 144 144
Norway 144 144
Oman 144 144
Pakistan 144 144
Panama 144 144
Papua New Guinea 144 144
Paraguay 144 144
Peru 144 144
Philippines 144 144
Poland 144 144
Portugal 144 144
Qatar 144 144
Romania 144 144
Russia 144 144
Rwanda 144 144
Saint Kitts and Nevis 144 144
Saint Lucia 144 144
Saint Vincent and the Grenadines 144 144
San Marino 144 144
Sao Tome and Principe 144 144
Saudi Arabia 144 144
Senegal 144 144
Serbia 144 144
Seychelles 144 144
Sierra Leone 144 144
Singapore 144 144
Slovakia 144 144
Slovenia 144 144
Somalia 144 144
South Africa 144 144
South Sudan 144 144
Spain 144 144
Sri Lanka 144 144
Sudan 144 144
Suriname 144 144
Sweden 144 144
Switzerland 144 144
Syria 144 144
Taiwan* 144 144
Tajikistan 144 144
Tanzania 144 144
Thailand 144 144
Timor-Leste 144 144
Togo 144 144
Trinidad and Tobago 144 144
Tunisia 144 144
Turkey 144 144
Uganda 144 144
Ukraine 144 144
United Arab Emirates 144 144
United Kingdom 1584 1584
Uruguay 144 144
US 144 144
US_state 469584 469584
Uzbekistan 144 144
Venezuela 144 144
Vietnam 144 144
West Bank and Gaza 144 144
Western Sahara 144 144
Yemen 144 144
Zambia 144 144
Zimbabwe 144 144
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5485
Alaska 1090
Arizona 1346
Arkansas 5805
California 5130
Colorado 4967
Connecticut 799
Delaware 335
Diamond Princess 89
District of Columbia 90
Florida 5790
Georgia 13069
Grand Princess 90
Guam 90
Hawaii 462
Idaho 2654
Illinois 7660
Indiana 7573
Iowa 7180
Kansas 6058
Kentucky 8626
Louisiana 5450
Maine 1389
Maryland 2105
Massachusetts 1337
Michigan 6734
Minnesota 6391
Mississippi 6763
Missouri 7698
Montana 2374
Nebraska 4647
Nevada 1071
New Hampshire 950
New Jersey 2015
New Mexico 2356
New York 5135
North Carolina 8036
North Dakota 2881
Northern Mariana Islands 75
Ohio 7149
Oklahoma 5526
Oregon 2753
Pennsylvania 5606
Puerto Rico 90
Rhode Island 532
South Carolina 3927
South Dakota 3650
Tennessee 7756
Texas 16755
Utah 1227
Vermont 1275
Virgin Islands 90
Virginia 10192
Washington 3502
West Virginia 3808
Wisconsin 5498
Wyoming 1711
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 58870
China 144
Diamond Princess 125
Korea, South 115
Japan 114
Italy 112
Iran 109
Singapore 106
France 105
Germany 105
Spain 104
US 102
Switzerland 101
United Kingdom 101
Belgium 100
Netherlands 100
Norway 100
Sweden 100
Austria 98
Malaysia 97
Australia 96
Bahrain 96
Denmark 96
Canada 95
Qatar 95
Iceland 94
Brazil 93
Czechia 93
Finland 93
Greece 93
Iraq 93
Israel 93
Portugal 93
Slovenia 93
Egypt 92
Estonia 92
India 92
Ireland 92
Kuwait 92
Philippines 92
Poland 92
Romania 92
Saudi Arabia 92
Indonesia 91
Lebanon 91
Pakistan 91
San Marino 91
Thailand 91
Chile 90
Luxembourg 89
Peru 89
Russia 89
Ecuador 88
Mexico 88
Slovakia 88
South Africa 88
United Arab Emirates 88
Armenia 87
Colombia 87
Croatia 87
Panama 87
Serbia 87
Taiwan* 87
Turkey 87
Argentina 86
Bulgaria 86
Latvia 86
Uruguay 86
Algeria 85
Costa Rica 85
Dominican Republic 85
Hungary 85
Andorra 84
Bosnia and Herzegovina 84
Jordan 84
Lithuania 84
Morocco 84
New Zealand 84
North Macedonia 84
Vietnam 84
Albania 83
Cyprus 83
Malta 83
Moldova 83
Brunei 82
Burkina Faso 82
Sri Lanka 82
Tunisia 82
Ukraine 81
Azerbaijan 80
Ghana 80
Kazakhstan 80
Oman 80
Senegal 80
Venezuela 80
Afghanistan 79
Cote d’Ivoire 79
Cuba 78
Mauritius 78
Uzbekistan 78
Cambodia 77
Cameroon 77
Honduras 77
Nigeria 77
West Bank and Gaza 77
Belarus 76
Georgia 76
Bolivia 75
Kosovo 75
Kyrgyzstan 75
Montenegro 75
Congo (Kinshasa) 74
Kenya 73
Niger 72
Guinea 71
Rwanda 71
Trinidad and Tobago 71
Paraguay 70
Bangladesh 69
Djibouti 67
El Salvador 66
Guatemala 65
Madagascar 64
Mali 63
Congo (Brazzaville) 60
Jamaica 60
Gabon 58
Somalia 58
Tanzania 58
Ethiopia 57
Burma 56
Sudan 55
Liberia 54
Maldives 52
Equatorial Guinea 51
Cabo Verde 49
Sierra Leone 47
Guinea-Bissau 46
Togo 46
Zambia 45
Eswatini 44
Chad 43
Tajikistan 42
Haiti 40
Sao Tome and Principe 40
Benin 38
Nepal 38
Uganda 38
Central African Republic 37
South Sudan 37
Guyana 35
Mozambique 34
Yemen 30
Mongolia 29
Mauritania 26
Nicaragua 26
Malawi 20
Syria 20
Zimbabwe 18
Bahamas 17
Libya 17
Comoros 15
Suriname 7
Angola 4
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 144
Korea, South 115
Japan 114
Italy 112
Iran 109
Singapore 106
France 105
Germany 105
Spain 104
US 102
Switzerland 101
United Kingdom 101
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-05-25        18407
## 2    Afghanistan           <NA> <NA> 2020-05-28        18410
## 3    Afghanistan           <NA> <NA> 2020-01-22        18283
## 4    Afghanistan           <NA> <NA> 2020-05-22        18404
## 5    Afghanistan           <NA> <NA> 2020-05-27        18409
## 6    Afghanistan           <NA> <NA> 2020-03-05        18326
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                    219                 11173     0.01960082
## 2                    235                 13036     0.01802700
## 3                      0                     0            NaN
## 4                    205                  9216     0.02224392
## 5                    227                 12456     0.01822415
## 6                      0                     1     0.00000000
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  4.048170                   2.340444        18348
## 2                  4.115144                   2.371068        18348
## 3                      -Inf                       -Inf        18348
## 4                  3.964542                   2.311754        18348
## 5                  4.095379                   2.356026        18348
## 6                  0.000000                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             59  NA   NA         NA                           NA
## 2             62  NA   NA         NA                           NA
## 3            -65  NA   NA         NA                           NA
## 4             56  NA   NA         NA                           NA
## 5             61  NA   NA         NA                           NA
## 6            -22  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9949467 -34.0
Hawaii Retail_Recreation 0.9733692 -56.0
New Hampshire Parks 0.9516530 -20.0
Vermont Parks 0.9334610 -35.5
Maine Transit -0.9153749 -50.0
Hawaii Parks 0.8901205 -72.0
Connecticut Grocery_Pharmacy -0.8870401 -6.0
Utah Residential -0.8813452 12.0
Hawaii Transit 0.8496478 -89.0
Utah Transit -0.8364013 -18.0
South Dakota Parks 0.8140692 -26.0
Arizona Grocery_Pharmacy -0.7803667 -15.0
Wyoming Parks -0.7738879 -4.0
Rhode Island Workplace -0.7648509 -39.5
Alaska Workplace -0.7554064 -33.0
Connecticut Transit -0.7536491 -50.0
Massachusetts Workplace -0.7511964 -39.0
Alaska Grocery_Pharmacy -0.7440462 -7.0
Arizona Residential 0.6568105 13.0
Vermont Grocery_Pharmacy -0.6547465 -25.0
New York Workplace -0.6522009 -34.5
Utah Parks -0.6301072 17.0
North Dakota Parks 0.6287940 -34.0
Nevada Transit -0.6272339 -20.0
Rhode Island Retail_Recreation -0.6264114 -45.0
Alaska Residential 0.6133742 13.0
New Jersey Parks -0.6044452 -6.0
Rhode Island Residential -0.6030669 18.5
Maine Workplace -0.5919932 -30.0
New York Retail_Recreation -0.5897372 -46.0
Nebraska Workplace 0.5826283 -32.0
Utah Workplace -0.5730188 -37.0
Arizona Retail_Recreation -0.5658537 -42.5
Idaho Residential -0.5656164 11.0
New Jersey Workplace -0.5393125 -44.0
New York Parks 0.5296739 20.0
Hawaii Residential -0.5273505 19.0
Connecticut Retail_Recreation -0.5202164 -45.0
Connecticut Residential 0.5126750 14.0
Massachusetts Retail_Recreation -0.4953482 -44.0
Kansas Parks 0.4932410 72.0
Maine Parks 0.4869264 -31.0
New Jersey Grocery_Pharmacy -0.4856657 2.5
New Mexico Grocery_Pharmacy -0.4813338 -11.0
Connecticut Workplace -0.4769649 -39.0
West Virginia Parks 0.4707675 -33.0
Montana Parks -0.4697588 -58.0
Iowa Parks -0.4685786 28.5
Nebraska Residential -0.4662297 14.0
New Mexico Residential 0.4527237 13.5
Rhode Island Parks 0.4526207 52.0
Maryland Workplace -0.4513739 -35.0
North Dakota Retail_Recreation -0.4462720 -42.0
Arizona Transit 0.4397605 -38.0
Arkansas Parks 0.4330209 -12.0
New Jersey Retail_Recreation -0.4319526 -62.5
Illinois Transit -0.4297233 -31.0
Wyoming Transit -0.4267305 -17.0
Pennsylvania Workplace -0.4203844 -36.0
Maryland Grocery_Pharmacy -0.4120112 -10.0
New Mexico Parks 0.4103248 -31.5
South Carolina Workplace 0.4090436 -30.0
New Jersey Transit -0.4078190 -50.5
Massachusetts Grocery_Pharmacy -0.4072699 -7.0
Vermont Residential 0.4036697 11.5
New Hampshire Residential -0.3994978 14.0
Montana Residential 0.3987967 14.0
Pennsylvania Retail_Recreation -0.3911827 -45.0
Michigan Parks 0.3899469 28.5
Alabama Workplace -0.3884750 -29.0
Kentucky Parks -0.3873871 28.5
Oregon Workplace -0.3863588 -31.0
Alabama Grocery_Pharmacy -0.3757784 -2.0
New York Transit -0.3681814 -48.0
New Mexico Retail_Recreation -0.3550218 -42.5
Oregon Parks -0.3510103 16.5
Wisconsin Transit -0.3479234 -23.5
Wyoming Workplace -0.3438342 -31.0
Nebraska Grocery_Pharmacy 0.3411368 -0.5
Idaho Workplace -0.3379420 -29.0
Idaho Grocery_Pharmacy -0.3360268 -5.5
Maryland Retail_Recreation -0.3241868 -39.0
Arkansas Retail_Recreation -0.3226260 -30.0
North Dakota Workplace 0.3213118 -40.0
California Transit -0.3183107 -42.0
Wyoming Grocery_Pharmacy -0.3178396 -10.0
Virginia Transit -0.3132472 -33.0
California Residential 0.3131862 14.0
Missouri Residential -0.3130083 13.0
Maine Retail_Recreation -0.3029801 -42.0
Alaska Retail_Recreation 0.3005294 -39.0
California Parks -0.2975197 -38.5
Idaho Transit -0.2971932 -30.0
Illinois Workplace -0.2901636 -30.5
Colorado Residential 0.2889829 14.0
Minnesota Transit -0.2884980 -28.5
Florida Residential 0.2870690 14.0
Nevada Residential 0.2836234 17.0
Pennsylvania Parks 0.2802100 12.0
Mississippi Residential 0.2798995 13.0
North Carolina Grocery_Pharmacy 0.2792034 0.0
Vermont Retail_Recreation 0.2779935 -57.0
Montana Transit 0.2745187 -41.0
South Carolina Parks -0.2712734 -23.0
North Carolina Workplace 0.2660595 -31.0
Georgia Grocery_Pharmacy -0.2645217 -10.0
Pennsylvania Grocery_Pharmacy -0.2641350 -6.0
Rhode Island Grocery_Pharmacy 0.2634588 -7.5
Alabama Transit -0.2604615 -36.5
Vermont Workplace -0.2597702 -43.0
Maryland Residential 0.2589344 15.0
Kansas Workplace 0.2570670 -32.5
Texas Workplace 0.2557705 -32.0
Rhode Island Transit -0.2528446 -56.0
Texas Residential -0.2527777 15.0
Tennessee Workplace -0.2485417 -31.0
Illinois Parks 0.2484216 26.5
Florida Parks -0.2414939 -43.0
Tennessee Residential 0.2388137 11.5
Georgia Workplace -0.2303292 -33.5
West Virginia Grocery_Pharmacy -0.2295498 -6.0
New York Grocery_Pharmacy -0.2294387 8.0
Wisconsin Parks 0.2259670 51.5
Washington Grocery_Pharmacy 0.2209276 -7.0
Hawaii Workplace 0.2201868 -46.0
Georgia Retail_Recreation -0.2191761 -41.0
South Dakota Workplace 0.2121632 -35.0
Minnesota Parks 0.2084674 -9.0
North Carolina Transit 0.2077433 -32.0
Nevada Retail_Recreation -0.2023512 -43.0
Arizona Parks -0.2016226 -44.5
North Dakota Grocery_Pharmacy -0.2013173 -8.0
West Virginia Workplace 0.2001422 -33.0
Connecticut Parks 0.1983918 43.0
Mississippi Grocery_Pharmacy -0.1969794 -8.0
Kansas Grocery_Pharmacy -0.1954093 -14.0
Missouri Workplace 0.1940457 -29.0
Montana Workplace -0.1937401 -40.0
Idaho Retail_Recreation -0.1913767 -39.5
Illinois Residential 0.1909188 14.0
Colorado Parks -0.1908221 2.0
Utah Retail_Recreation -0.1898258 -40.0
North Carolina Residential 0.1898254 13.0
Kentucky Transit -0.1896974 -31.0
Virginia Residential 0.1847617 14.0
Oregon Transit 0.1843150 -27.5
Nebraska Transit -0.1828710 -9.0
Indiana Parks -0.1821809 29.0
Wisconsin Residential -0.1782410 14.0
Texas Parks 0.1778030 -42.0
Kentucky Grocery_Pharmacy 0.1764460 4.0
Oregon Residential 0.1737891 10.5
Nevada Workplace -0.1662680 -40.0
Pennsylvania Transit -0.1660822 -42.0
New Hampshire Retail_Recreation -0.1645136 -41.0
Indiana Residential 0.1637693 12.0
Massachusetts Parks 0.1614386 39.0
New Jersey Residential 0.1612129 18.0
Arkansas Residential 0.1612043 12.0
New Mexico Transit 0.1604428 -38.5
Ohio Transit 0.1579698 -28.0
Tennessee Parks -0.1570044 10.5
Alabama Parks 0.1532501 -1.0
South Dakota Residential 0.1501998 15.0
Iowa Transit 0.1498610 -24.0
Michigan Workplace -0.1444280 -40.0
California Grocery_Pharmacy -0.1405264 -11.5
Virginia Grocery_Pharmacy -0.1404667 -8.0
Minnesota Retail_Recreation 0.1382026 -40.5
Indiana Retail_Recreation 0.1379893 -38.0
Missouri Transit -0.1343831 -24.5
North Dakota Residential -0.1336070 17.0
Mississippi Retail_Recreation -0.1293922 -40.0
Missouri Parks 0.1290443 0.0
North Dakota Transit 0.1276054 -48.0
Wisconsin Workplace -0.1246316 -31.0
Texas Grocery_Pharmacy 0.1241358 -14.0
California Retail_Recreation -0.1219539 -44.0
Oklahoma Parks 0.1212099 -18.5
Mississippi Transit -0.1185975 -38.5
Wisconsin Grocery_Pharmacy 0.1179130 -1.0
Florida Retail_Recreation 0.1171639 -43.0
North Carolina Retail_Recreation 0.1132910 -34.0
Kentucky Residential 0.1108641 12.0
Wyoming Residential 0.1107981 12.5
Idaho Parks 0.1069615 -22.0
Nebraska Retail_Recreation 0.1069000 -36.0
Arkansas Workplace -0.1041794 -26.0
Massachusetts Transit -0.1037215 -45.0
Oklahoma Grocery_Pharmacy -0.1030956 -0.5
Virginia Parks 0.1023935 6.0
Mississippi Parks -0.1017263 -25.0
Massachusetts Residential 0.1016548 15.0
New Hampshire Grocery_Pharmacy -0.1012342 -6.0
Kansas Transit -0.1009923 -26.5
Iowa Workplace 0.1009534 -30.0
Maryland Transit -0.0976388 -39.0
Michigan Transit 0.0976259 -46.0
Michigan Residential 0.0962757 15.0
Michigan Retail_Recreation -0.0955879 -53.0
South Dakota Transit -0.0940955 -40.0
Utah Grocery_Pharmacy 0.0937666 -4.0
Ohio Residential 0.0931521 14.0
Minnesota Grocery_Pharmacy 0.0921148 -6.0
New York Residential 0.0910560 17.5
Texas Transit 0.0904425 -41.0
Georgia Residential -0.0898157 13.0
Oklahoma Workplace 0.0896631 -31.0
Nevada Parks 0.0891647 -12.5
Oregon Grocery_Pharmacy 0.0890818 -7.0
Indiana Workplace 0.0880709 -34.0
Virginia Workplace -0.0867119 -31.5
Alabama Retail_Recreation 0.0866785 -39.0
Ohio Parks -0.0866265 67.5
Iowa Grocery_Pharmacy -0.0852206 4.0
Washington Workplace -0.0802255 -38.0
Georgia Parks 0.0800759 -6.0
Oregon Retail_Recreation -0.0790253 -40.5
Pennsylvania Residential 0.0769963 15.0
Minnesota Residential -0.0754044 17.0
West Virginia Transit -0.0750392 -45.0
Virginia Retail_Recreation -0.0748693 -35.0
Alabama Residential -0.0746417 11.0
Ohio Grocery_Pharmacy 0.0728443 0.0
Kentucky Retail_Recreation 0.0720589 -29.0
North Carolina Parks -0.0681077 7.0
Maine Residential -0.0675291 11.0
West Virginia Residential -0.0666063 11.0
Washington Residential 0.0652501 13.0
Minnesota Workplace -0.0647046 -33.0
Colorado Transit 0.0637221 -36.0
Texas Retail_Recreation 0.0600429 -40.0
Florida Grocery_Pharmacy 0.0583275 -14.0
Washington Transit -0.0574995 -33.5
California Workplace -0.0553351 -36.0
Vermont Transit -0.0538446 -63.0
Kentucky Workplace -0.0488470 -36.5
Ohio Retail_Recreation 0.0486883 -36.0
South Dakota Retail_Recreation -0.0483145 -39.0
Washington Parks 0.0481290 -3.5
New Hampshire Workplace 0.0478223 -37.0
Florida Transit -0.0461990 -49.0
Indiana Transit 0.0456308 -29.0
Oklahoma Residential 0.0432986 15.0
Missouri Retail_Recreation -0.0404022 -36.0
Georgia Transit -0.0391341 -35.0
Ohio Workplace -0.0373412 -35.0
Illinois Grocery_Pharmacy -0.0369977 2.0
Iowa Retail_Recreation 0.0343967 -38.0
South Carolina Transit 0.0343608 -45.0
Mississippi Workplace -0.0335317 -33.0
Wisconsin Retail_Recreation 0.0295297 -44.0
New Mexico Workplace 0.0292977 -34.0
Indiana Grocery_Pharmacy -0.0286270 -5.5
Maine Grocery_Pharmacy -0.0285544 -13.0
Colorado Grocery_Pharmacy -0.0269917 -17.0
South Carolina Grocery_Pharmacy 0.0265314 1.0
Arizona Workplace -0.0262688 -35.0
Arkansas Grocery_Pharmacy -0.0258715 3.0
Tennessee Transit 0.0253777 -32.0
Michigan Grocery_Pharmacy -0.0235744 -11.0
South Carolina Residential -0.0233822 12.0
West Virginia Retail_Recreation -0.0211835 -38.5
Illinois Retail_Recreation 0.0207434 -40.0
Kansas Residential -0.0200601 13.0
Colorado Retail_Recreation -0.0200550 -44.0
Tennessee Grocery_Pharmacy 0.0195348 6.0
Washington Retail_Recreation 0.0193448 -42.0
Montana Grocery_Pharmacy -0.0182986 -16.0
Montana Retail_Recreation 0.0180204 -50.0
Oklahoma Transit 0.0168684 -26.0
Colorado Workplace 0.0160304 -39.0
Tennessee Retail_Recreation -0.0160275 -30.0
Iowa Residential -0.0159186 13.0
New Hampshire Transit 0.0150397 -57.0
Arkansas Transit 0.0145091 -27.0
Nebraska Parks 0.0132939 55.5
Maryland Parks 0.0131254 27.0
Florida Workplace -0.0128371 -33.0
Wyoming Retail_Recreation -0.0127351 -39.0
Missouri Grocery_Pharmacy 0.0126000 2.0
Kansas Retail_Recreation -0.0069129 -38.0
South Dakota Grocery_Pharmacy -0.0056698 -9.0
Oklahoma Retail_Recreation 0.0055525 -31.0
South Carolina Retail_Recreation -0.0033269 -35.0
Nevada Grocery_Pharmacy 0.0022932 -12.5
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Sun Jun 14 22:07:32 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net